main.py

This is the file where we show all of the graphs and analyses we've performed on our datasets.

Importing Data and Modules

Here, we load all of our data and modules that we're going to use.

Data Cleanup

Here, we check for duplicates, empty values etc. in our Accounts and Posts dataframes. We do end up finding one duplicate user in Accounts so we need to remove him/her, remove him/her from all followers list and recompute the numbers.

Posts data, on the other hand, was far nicer.

Exporting our data

Exploratory Data Analysis

In this section, we can take a look at statistical properties of our data.

Accounts

Posts

Graph Construction

This section constructs the graphs of Accounts and Posts that we'll use in the metrics to follow.

Accounts Graph Construction

Conclusion: Our user have broad connections, not deep ones.

Posts Graph Construction

Here we construct the graph for the posts. Each node is identified by its index in the dataframe, and then, its attributes are given as node attributes. All edges are added based on this.

Metrics

Let's compute our KPI's!

Metric 1: Interactivity

Here, we implement the first metric proposed in the first deliverable. This measures the number of likes, clicks, reposts, donations etc. that each post has. We can use this to create a composite ranking, which then provides us a KPI to maximise.

Weak Components

We expect 4 components - one associated to each of the 4 original posts used to seed our userbase. And that's what we get ! Of course, we have weak components because we're working in a Directed Acyclic Graph, so we'll need to ignore the direction of our edges to find components.

Metric 2 : Reachability

Metric 2a: Visibility

Here, we implement a metric proposed as part of the first deliverable, visibility. This measures of the total number of views that the campaign have.

Metric 2b: Virality

Virality is the speed at which the campaign was propagated. The notion of speed, will be provided for by the diameter of the each connected components. The diameter is inversely proportional to the speed of the campaign as the diameter denotes how many degree of separation between the source node and the "furthest" node.

Evaluation and Analysis of Previous Campaign

Seeding Strategies

Probability of Clicks

The probability that a user who has seen a post will click on the link to the site.

Probability of Donation

Evaluated as the number of donors over the number of possible donors(number of site visitors). prob_donation gives the probability that someone donated given that they clicked on the link to the site.

Simulation

Here, we finally begin our simulations!

Merged Dataset

The merged_dataset is a tool for us to quickly access users and posts without excessive querying - nothing to worry about :)

The model starts right after this.

Machine Learning Regression Model for Donation Value

Model

Compartment Model under consideration.

Poster ---> Followers ---->

                        View ----> Reposts

                        View ----> Comment

                        View ----> Like

Poster ---> Link Click ---> Donation

The simulation essentially works as follows.

We achieve this using a simple Breadth First Traversal of the accounts graph, assuming that the best possible seeds have been chosen to start from. This choice of seeds effectively decides the strategy.

In case of a repost, we populate a simulated posts_data data called new_posts_data.

The number of reposts is given by the outdegree of the graph. This is because each outgoing edge in the Posts graph represents a poster->reposter link between posts.

KPI Evaluation: Simulation Edition

Let's see how well we did, as compared to the original campaign.

Monte Carlo Simulation

Exporting Simulation Data